Concepedia

Concept

scalable computing

Parents

Children

1.9K

Publications

120K

Citations

5.8K

Authors

1.2K

Institutions

Fault-Tolerant Data-Parallel Clusters

2002 - 2008

During 2002–2008, research patterns consolidated scalable data processing across large clusters and grids by unifying data-parallel programming models, distributed scheduling, and data locality in map-reduce–style workflows and web-scale architectures. Patterns of reliability and observability matured into core design goals, enabling high-speed cluster monitoring, fault-tolerant execution, and highly available web services across wide-area networks. Efforts to optimize inter-node communication and acceleration advanced fast collective operations, NIC-based reductions, and scalable web server accelerators, while teams defined quantitative scalability metrics and engineering practices to measure and improve distributed systems.

Unified patterns for scalable data processing across large clusters and grids, integrating data-parallel programming models, distributed scheduling, and data locality in map-reduce–style workflows and web-scale architectures [1], [5], [6], [14], [16].

Patterns of reliability and observability that enable scalable systems: high-speed cluster monitoring, fault-tolerant execution, wide-area monitoring, and highly available web services [3], [8], [9], [11], [13].

Patterns optimizing inter-node communication and acceleration for large clusters, including fast collective operations, NIC-based reductions, and scalable web server accelerators [12], [15], [2], [13].

Quantitative and engineering patterns for measuring, defining, and improving scalability, including scalability metrics, execution-time tradeoffs, software-engineering considerations, and malleability/migratability in distributed apps [18], [19], [17], [7].

Unified Cluster Resource Management

2009 - 2021